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This  project  pursued  the  development  of  representations  of  visual  data  suitable  for  control  and  decision  tasks.  The  fundamental  premise  is 
that  traditional  notions  of  information  developed  in  support  of  communication  engineering  -  where  the  task  is  reproduction  of  the  source 
data,  and  nuisance  factors  can  be  easily  characterized  statistically  -  are  unsuited  to  visual  inference,  where  the  task  is  decision  or  control,  and 
the  data  formation  process  include  scaling  (that  makes  the  continuous  limit  relevant)  and  occlusion  (that  makes  control  relevant). 

Specifically,  the  task  (or  classes  of  task)  inform  what  portion  of  the  data  is  “informative”  and  what  is  “nuisance  variability.”  One  of  the 
peculiarities  of  visual  processing  is  that  most  of  the  complexity  in  visual  data  can  be  ascribed  to  nuisance  factors  that  affect  the  data  but  are 
irrelevant  to  the  task.  This  notion  has  been  made  precise  in  [16],  preceding  the  commencement  of  this  project,  where  it  was  shown  that  the 
quotient  of  the  (infinite-  dimensional)  set  of  image  modulo  changes  of  viewpoint  and  illumination  is  supported  on  a  set  of  measure  zero  of 
the  image  domain.  So,  for  any  task  that  requires  invariance  to  viewpoint  and  illumination  (such  as  object  detection,  localization,  recognition, 
categorization),  a  zero-  measure  set  contains  as  much  “information”  as  the  original  data.  In  other  words,  information  is  a  thin  set  of  visual 
data,  for  decision  and  control  tasks. 

The  development  of  a  theory  of  information  in  support  of  decision  and  control  tasks,  specific  to  visual  data,  is  a  long-term  goal  that  has  been 
under  development  since  2007,  and  will  continue  for  the  foreseeable  future.  During  the  course  of  this  project,  significant  progress  has  been 
registered  in  a  number  of  areas  critical  to  such  a  development,  which  is  described  below. 
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This  project  pursued  the  development  of  representations  of  visual  data  suitable  for  control 
and  decision  tasks.  The  fundamental  premise  is  that  traditional  notions  of  information 
developed  in  support  of  communication  engineering  -  where  the  task  is  reproduction  of  the 
source  data,  and  nuisance  factors  can  be  easily  characterized  statistically  -  are  unsuited 
to  visual  inference,  where  the  task  is  decision  or  control,  and  the  data  formation  process 
include  scaling  (that  makes  the  continuous  limit  relevant)  and  occlusion  (that  makes  control 
relevant). 

Specifically,  the  task  (or  classes  of  task)  inform  what  portion  of  the  data  is  “informative” 
and  what  is  “nuisance  variability.”  One  of  the  peculiarities  of  visual  processing  is  that 
most  of  the  complexity  in  visual  data  can  be  ascribed  to  nuisance  factors  that  affect  the 
data  but  are  irrelevant  to  the  task.  This  notion  has  been  made  precise  in  [16],  preceding 
the  commencement  of  this  project,  where  it  was  shown  that  the  quotient  of  the  (inhnite- 
dimensional)  set  of  image  modulo  changes  of  viewpoint  and  illumination  is  supported  on  a  set 
of  measure  zero  of  the  image  domain.  So,  for  any  task  that  requires  invariance  to  viewpoint 
and  illumination  (such  as  object  detection,  localization,  recognition,  categorization),  a  zero- 
measure  set  contains  as  much  “information”  as  the  original  data.  In  other  words,  information 
is  a  thin  set  of  visual  data,  for  decision  and  control  tasks. 

The  development  of  a  theory  of  information  in  support  of  decision  and  control  tasks, 
specihc  to  visual  data,  is  a  long-term  goal  that  has  been  under  development  since  2007,  and 
will  continue  for  the  foreseeable  future.  During  the  course  of  this  project,  significant  progress 
has  been  registered  in  a  number  of  areas  critical  to  such  a  development,  which  is  described 
below. 

Below  is  a  summary  of  the  research  accomplishments  achieved  during  the  period  of  per¬ 
formance,  including  progress  since  the  last  report  dated  July  31,  2012. 
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Actionable  Information 

We  have  continued  progress  on  occlusion  detection,  previously  reported  in  [1] ,  that  has  been 
expanded  and  published  in  journal  form  in  [2]  and  has  been  instrumental  in  the  quantihcation 
of  uncertainty  in  the  presence  of  topological  uncertainty  due  to  occlusions.  In  fact,  in  the 
presence  of  visibility  nuisances,  the  “innovation”  (information  increment)  is  precisely  the 
complexity  of  the  maximal  invariant  of  the  data  relative  to  the  nuisances  restricted  to  the 
discovered  area  of  the  data  domain.  We  have  focused  on  images,  but  the  results  apply  to 
any  remote  sensing  modality  including  lidar,  time-of- flight,  etc. 

While  the  formal  dehnitions  and  the  general  principles  cut  across  sensors,  instantiat¬ 
ing  a  model  so  that  actual  values  can  be  computed  can  be  quite  challenging.  Specihcally, 
for  passive  remote  sensing  such  as  electro-optical,  the  simplest  model  that  includes  all  the 
key  ingredients  (illumination,  reflectance,  shape,  motion,  occlusions,  etc.)  is  the  so-called 
Lambert- Ambient  model,  that  is  already  rather  complex  to  analyze.  In  [4]  we  have  com¬ 
menced  an  explicit  analysis  of  this  model  for  the  purpose  of  characterization  of  Actionable 
Information.  The  quotient  structure  (“shape  space”)  of  this  model  has  been  fully  char¬ 
acterized,  and  the  results  have  been  shown  to  yield  explicit  construction  of  visual  feature 
descriptors  that  are  -  by  construction  -  best  suited  for  the  specihc  task  of  object  detection. 
This  is  another  intermediate  step  in  the  design  of  a  fully  rational  visual  inference  system 
that  is  driven  by  analytically  sound  principles  as  opposed  to  vague  biological  inspiration  or 
trial-and-error  practices. 

One  example  of  application  of  these  principles  is  illustrated  in  [10],  where  we  have  been 
able  to  design  descriptors  that  beat  the  state  of  the  art  in  standard  benchmark  datasets,  and 
in  addition  do  so  while  signihcantly  reducing  the  computational  load,  to  the  point  where  they 
can  be  implemented  on  mobile  devices  (smartphones)  and  operate  in  real-time  (7-15FPS). 

Inference 

In  the  early  part  of  this  project,  we  have  focused  on  -  to  a  large  extent  solved  -  the  problem 
of  occlusion  detection.  Occlusion  plays  a  fundamental  role  in  the  theory  of  Actionable 
Information,  for  the  maximal  viewpoint  invariant  in  the  un-occluded  region  represents  the 
innovation  process,  or  “information  increment”  that  comes  as  a  result  of  a  control  action. 
Therefore,  it  was  very  important  that  efficient  and  provably  optimal  occlusion  detection 
algorithms  be  developed.  Most  of  the  existing  literature  on  the  topic  was  flawed  in  funda¬ 
mental  ways:  First,  most  literature  on  motion  estimation  characterizes  occluding  boundaries 
as  motion  discontinuities.  This  is  because  such  a  literature  is  rooted  in  variational  image 
processing,  where  the  limit  dt  — )■  0  is  considered.  When  the  inter-frame  temporal  interval 
goes  to  zero,  clearly  there  are  no  occlusions.  However,  there  is  no  motion  either,  so  this  litera¬ 
ture  completely  neglected  occlusions.  More  recent  literature  that  tackles  occlusion  detection 
directly  focuses  on  occluded  regions  as  those  where  forward-  and  backward-motion  are  in¬ 
consistent.  However,  an  occluded  region  is  one  that  is  visible  in  one  image  but  not  the  next, 
and  therefore  the  motion  between  these  regions  in  the  two  images  is  not  just  inconsistent,  it 
is  simply  not  defined,  it  does  not  exist. 
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We  have  formalized  the  problem  as  a  variational  (inhnite-dimensional)  optimization  prob¬ 
lem,  where  one  simultaneously  tries  to  estimate  the  motion  that  maps  one  image  onto  the 
next  (a  diffeomorphism) ,  together  with  the  portion  of  the  domain  where  such  a  map  is  de- 
hned.  These  are  the  co-visible  regions,  and  their  complement  is  the  (multiply-connected) 
occluded  region.  While  this  appears  to  be  a  very  difficult  optimization  problem,  under  the 
assumptions  of  Lambertian  Reflection  and  constant  (or  slowly  varying)  ambient  illumination, 
we  have  been  able  to  show  that  the  problem  can  be  framed  as  a  mixed  optimization 
problem,  whose  re-weighted  relaxation  yields  a  convex  optimization  problem.  This  has  en¬ 
abled  us  to  exploit  all  the  recent  arsenal  of  efficient  numerical  schemes  for  convex  optimiza¬ 
tion,  including  Augmented  Lagrangian  schemes  such  as  Split-Bregman,  as  well  as  optimal 
hrst-order  schemes  due  to  Nesterov.  We  have  made  our  source  code  publicly  available,  and 
many  have  been  using  it  since.  Furthermore,  we  have  developed  a  GPU  implementation  of 
the  algorithm,  that  exploits  its  parallelism,  and  achieved  real-time  implementation  [5]. 

The  importance  of  mobility  for  recognition  is  further  emphasized  in  [9],  that  presents 
a  real-time  visual  recognition  system  based  on  sparse  representations  that  are  guided  by 
Actionable  Information.  Descriptors  for  static  objects  are  built  from  multiple  views,  taken 
while  the  user  moves  around  an  object.  While  in  this  case  the  control  loop  is  closed  by  the 
user,  and  the  system  assumes  that  the  user  acts  rationally  in  collecting  images  of  an  objects 
that  approximate  a  fair  sample  from  the  conditional  distribution  of  a  maximal  invariant  to 
nuisance  factors,  eventually  we  expect  to  design  closed-loop  robotic  systems  that  perform 
such  a  “visual  exploration”  automatically. 

Semantic  Video  Segmentation 

Occlusion  detection  is  critical  to  bootstrap  the  relation  between  structures  in  the  images  of 
a  video,  and  the  topology  of  the  scene.  In  addition,  one  may  be  able  to  associate  regions 
of  images  to  “labels”,  in  which  case  the  notion  of  Detachable  Object  -  bootstrapped  from 
occlusion  relations  -  provides  a  semantic  segmentation  of  the  video,  where  objects  are  repre¬ 
sented  as  simply  connected  regions  that  back-project  onto  piecewise  continuous  surfaces  in 
space  that  have  consistent  labeling.  It  also  provides  relations  between  objects,  in  the  sense 
that  the  scheme  does  not  just  attach  labels  to  object,  but  also  determines  whether  there 
are  multiple  objects,  and  in  what  depth  ordering  they  are  presented  relative  to  the  viewer. 
The  key  results  have  been  presented  in  [17],  where  the  ideas  have  been  tested  on  benchmark 
datasets. 

Active  Learning  and  Active  Inference 

In  order  to  infer  semantic  segmentation  of  a  video,  a  probability  of  detection  should  be 
provided  at  each  pixel  of  each  frame,  for  each  possible  category  label,  which  is  clearly  a 
tall  order.  In  the  presence  of  millions  of  pixels  per  frame,  hundreds  of  frames,  and  tens  of 
objects,  that  would  mean  running  billions  of  object  detection  algorithms  just  to  process  a 
few  seconds  of  video. 
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Therefore,  in  [7]  we  have  focused  on  active  inference,  where  an  information-gain  criterion 
drives  the  decision  as  to  what  frames  to  process  (when),  what  pixels  within  each  frame 
(where),  and  what  detectors  to  run  (what),  based  on  spatio-temporal  regularity  and  context. 
While  in  principle  one  would  not  expect  that  running  a  subset  of  detectors  on  a  subset  of 
locations  in  a  subset  of  frames  could  improve  performance  over  the  “paragon”  setting,  we 
have  found  that  indeed  full  exploitation  of  spatio-temporal  regularities  and  context  enable 
improving  the  performance  in  the  pixel-level  detection  of  multiple  classes  in  video,  at  a 
fraction  of  the  computational  cost.  This  may  seem  counter  to  the  Data  Processing  Inequality, 
which  is  one  of  the  driving  principles  of  our  work,  with  the  conundrum  addressed  by  the 
strong  prior  structure  in  natural  and  man-made  scenes.  In  [7]  we  have  performed  test  on 
benchmark  video  segmentation  tasks  with  19  categories. 

Visual- inertial  sensor  fusion  with  application  to  humanoid  motion 
estimation 

While  detachable  object  detection  and  semantic  video  segmentation  provide  topological  rela¬ 
tions  between  surfaces  in  the  scene,  and  their  relation  to  the  viewer,  full  geometric  modeling 
of  the  scene  requires  the  estimation  of  the  Euclidean  motion  of  the  sensor  platform.  This  is 
a  long-standing  problem,  where  the  UCLA  Vision  Lab  has  played  a  pioneering  role,  start¬ 
ing  with  the  first  observability  analysis,  to  the  first  demonstration  of  real-time  estimation 
of  general  structure-from-motion,  to  the  hrst  analysis  of  the  observability  of  visual-inertial 
fusion. 

During  this  project  we  have  further  pushed  the  envelope  to  include  motion  priors  suitable 
for  human  (or  humanoid)  motions  that  are  not  well  approximated  by  tightly  tuned  random 
walks  [18]. 

Personnel  supported  on  the  project 

See  information  uploaded  on  the  ARO  Report  System. 

Awards  and  honors 

see  list  uploaded  on  ARO  Report  System 

Teaching  and  Mentoring 

In  addition,  material  developed  during  the  course  of  this  project  has  been  used  in  support 
of  teaching  and  mentoring:  Specihcally,  the  fundamental  building  blocks  and  high-level 
concepts  underlying  Visual  Information  Theory  [13]  have  been  used  to  teach  a  quarter-long 
course  at  UCLA  (CS269:  Visual  Information  Theory),  as  well  as  to  teach  summer  courses 
at  the  International  Computer  Vision  Summer  School  (ICVSS)  in  Scicli,  Italy,  in  2011, 
2012,  2013,  and  the  US-Sino  Summer  School  on  Computer  Vision,  Pattern  Recognition  and 
Machine  Learning  in  Chengdu,  China  in  2011.  Some  of  the  elements  of  the  theory  have  been 
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collected  into  a  book  chapter  for  the  lecture  notes  of  ICVSS  [14],  while  [13]  is  an  evolving 
manuscript  that  eventually  will  be  turned  into  a  textbook. 

Several  students  who  have  been  mentored  during  the  course  of  this  project  (some  sup¬ 
ported  directly  by  this  grant,  some  interacting  indirectly  with  critical  personnel  but  sup¬ 
ported  on  other  sources)  have  moved  on  to  key  positions  in  industry  and  academia,  includ¬ 
ing  the  University  of  Oxford,  U.K.  (Prof.  Andrea  Vedaldi),  Temple  University  (Dr.  Haibin 
Ling),  KAUST  (Dr.  Ganesh  Sundaramoorthi),  Google  ING.  (Dr.  Alessandro  Bissacco,  Dr. 
Teresa  Ko,  Dr.  Taehee  Lee,  Dr.  Zhao  Yi),  Amazon  ING.  (Dr.  Avinash  Ravichandran) , 
Honda  Research  (Dr.  Alper  Ayvaci),  Gomcast  (Dr.  Michalis  Raptis). 

Publications 

The  following  references  describe  work  that  has  been  conducted  during  this  project  and 
acknowledge  support  by  ARO:  Year  1:  [14,  3,  9,  15,  1,  5,  8,  12],  Year  2:  [2,  4,  10,  6,  11,  19], 
Year  3:  [7,  17,  18]. 

Transitions 

While  complete  transitions  have  not  been  accomplished  during  the  period  of  performance, 
since  the  research  focused  on  fundamental  issues  underlying  the  theoretical  development  of 
a  theory  of  visual  information,  the  research  milestones  accomplished  enabled  improvement 
of  specific  tasks  that  we  envision  will  result  in  transitions  in  the  near  to  mid-term  future. 
These  include  visual-inertial  sensor  fusion  [18],  and  semantic  video  segmentation  [17]. 
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